Notes:
Notes:
library(ggplot2)
pf <- read.csv("../lesson3/pseudo_facebook.tsv", sep = "\t")
qplot(age, friend_count, data = pf)
Response: Most people have small number of friends on Facebook. Some young people have a large number of friends.
Notes:
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point() +
xlim(13, 90)
## Warning: Removed 4906 rows containing missing values (geom_point).
Notes:
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_jitter(alpha = 1 / 20) +
xlim(13, 90)
## Warning: Removed 5154 rows containing missing values (geom_point).
Response: Most user have less than 500 friends.
Notes:
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha = 1 / 20, position = position_jitter(h = 0)) +
xlim(13, 90) +
coord_trans(y = "sqrt")
## Warning: Removed 5173 rows containing missing values (geom_point).
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha = 1 / 20) +
xlim(13, 90) +
coord_trans(y = "sqrt")
## Warning: Removed 4906 rows containing missing values (geom_point).
The black bars look higher.
Notes:
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
geom_point(alpha = 1 / 10, position = position_jitter(h = 0)) +
coord_trans(y = "sqrt")
Notes:
Notes:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
pf.fc_by_age <- pf %>%
group_by(age) %>%
summarise(friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
head(pf.fc_by_age)
## # A tibble: 6 x 4
## age friend_count_mean friend_count_median n
## <int> <dbl> <dbl> <int>
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
Create your plot!
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
geom_line()
Notes:
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha = 1 / 20, position = position_jitter(h = 0),
color = "orange") +
coord_cartesian(xlim = c(13, 70), ylim = c(0, 1000)) +
geom_line(stat = "summary", fun.y = mean) +
geom_line(stat = "summary", fun.y = quantile, fun.args = list(probs = 0.1),
linetype = 2, color = "blue") +
geom_line(stat = "summary", fun.y = quantile, fun.args = list(probs = 0.9),
linetype = 2, color = "blue") +
geom_line(stat = "summary", fun.y = median, color = "blue")
Response: All lines except the 10% quantile have similar trends. The median is smaller than the mean.
See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.
Notes:
Notes:
cor.test(pf$friend_count, pf$age)
##
## Pearson's product-moment correlation
##
## data: pf$friend_count and pf$age
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
Look up the documentation for the cor.test function.
What’s the correlation between age and friend count? Round to three decimal places. Response:
Notes:
with(subset(pf, age <= 70), cor.test(age, friend_count))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1780220 -0.1654129
## sample estimates:
## cor
## -0.1717245
Notes:
Notes:
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
geom_point() +
coord_cartesian(xlim = c(0, 4e4), ylim = c(0, 1e5))
Notes:
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
geom_point() +
xlim(0, quantile(pf$www_likes_received, 0.95)) +
ylim(0, quantile(pf$likes_received, 0.95)) +
geom_smooth(method = "lm", color = "red")
## Warning: Removed 6075 rows containing non-finite values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).
What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
with(pf, cor.test(www_likes_received, likes_received))
##
## Pearson's product-moment correlation
##
## data: www_likes_received and likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9473553 0.9486176
## sample estimates:
## cor
## 0.9479902
Response: 0.948
Notes:
Notes:
library(alr3)
## Loading required package: car
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
data("Mitchell")
Create your plot!
ggplot(aes(x = Month, y = Temp), data = Mitchell) +
geom_point()
Take a guess for the correlation coefficient for the scatterplot. 0
What is the actual correlation of the two variables? 0.057 (Round to the thousandths place)
with(Mitchell, cor.test(Month, Temp))
##
## Pearson's product-moment correlation
##
## data: Month and Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08053637 0.19331562
## sample estimates:
## cor
## 0.05747063
Notes:
ggplot(aes(x = Month, y = Temp), data = Mitchell) +
geom_point() +
scale_x_continuous(breaks = seq(0, 203, 12))
What do you notice? Response: The temperature is periodic every year.
Watch the solution video and check out the Instructor Notes! Notes:
Notes:
pf$age_with_months <- with(pf, age + 1 - dob_month / 12)
pf.fc_by_age_months <- pf %>%
group_by(age_with_months) %>%
summarise(friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n()) %>%
arrange(age_with_months)
head(pf.fc_by_age_months)
## # A tibble: 6 x 4
## age_with_months friend_count_mean friend_count_median n
## <dbl> <dbl> <dbl> <int>
## 1 13.16667 46.33333 30.5 6
## 2 13.25000 115.07143 23.5 14
## 3 13.33333 136.20000 44.0 25
## 4 13.41667 164.24242 72.0 33
## 5 13.50000 131.17778 66.0 45
## 6 13.58333 156.81481 64.0 54
Programming Assignment
age_with_months_groups <- group_by(pf, age_with_months)
pf.fc_by_age_months2 <- summarise(age_with_months_groups,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
pf.fc_by_age_months2 <- arrange(pf.fc_by_age_months2, age_with_months)
head(pf.fc_by_age_months2)
## # A tibble: 6 x 4
## age_with_months friend_count_mean friend_count_median n
## <dbl> <dbl> <dbl> <int>
## 1 13.16667 46.33333 30.5 6
## 2 13.25000 115.07143 23.5 14
## 3 13.33333 136.20000 44.0 25
## 4 13.41667 164.24242 72.0 33
## 5 13.50000 131.17778 66.0 45
## 6 13.58333 156.81481 64.0 54
ggplot(aes(x = age_with_months, y = friend_count_mean),
data = subset(pf.fc_by_age_months, age_with_months < 71)) +
geom_line()
Notes:
p1 <- ggplot(aes(x = age, y = friend_count_mean),
data = subset(pf.fc_by_age, age < 71)) +
geom_line() +
geom_smooth()
p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean),
data = subset(pf.fc_by_age_months, age_with_months < 71)) +
geom_line() +
geom_smooth()
p3 <- ggplot(aes(x = round(age / 5) * 5, y = friend_count),
data = subset(pf, age < 71)) +
geom_line(stat = "summary", fun.y = mean)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(p1, p2, p3)
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
Notes:
Reflection: For data with discrete values, it’s useful to use jitter and transparency for visualization.
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!